DOCUMENT RESUME 



ED 241 540 



TM 830 275 



AUTHOR 
TITLE 

PUB DATE 
NOTE 



PUB TYPE 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Skaggs, Gary; Lissitz, Robert H, 

Test Equatlngi Relevant Issues and a Review of Recent 
Research « 
Mar 82 

62p« ; Paper presented at the Annual Meetina of the 
American Educational Research Association Tssth, Los 
Angeles, CA, April 13-17, 1981). 
Speeches/Conference Papers <150) — Information 
Analyses (070) 

MF01/PC03 Plus Postage. 

Educational Research; *Equated Scores; *Latent Trait 
Theory; Literature Reviews; Standar^.ized Tests; 
Tables (Data); Test Construction; *Testing; Testing 
Problems 

One Parameter Model; *Rasch Model; *Three Parameter 
Model 



ABSTRACT 

Equating studies using item response theory (IRt) are 
reviewed. The most well-known papers, as well as a sampling of 
lesser*known studies, are included. Accompanying tables list the 
papers and classify them according to the test used, models used, 
test length and type, sample size and type, method of assessment, 
equating design, and kinds of comparisons made. A majority of the 
equating research has focused on the Rasch, or one parameter 
logistic, model. Initial Studies using the Rasch model investigated 
the invariance properties of the modeli person--free item calibration 
and item-free person measurement. With tests of similar difficulty 
and samples of comparable ability, the research suggests that Rasch 
horizontal equating provides reasonable results. Research on other 
IRT models has focused on coiqparing different strategies using the 
same data set. Hith regard to vertical equating, most of the research 
has demonstrated the superiority of the three*parameter model over 
the Rasch model. Additional equating studies using Monte Carlo 
methods are reviewed. Finally, four issues relevant to test equating 
are discusf;ed: assessing the adequacy of equating, sources of 
equating error, multidimensionality, and out*of*level testin^f. 
(PN) 



********************************************* Ik******************** A**** 

* Reproductions supplied by EDRS are the best that can be made * 

* from the original documeiit. * 
*********************************************************************** 



iRIC 



TEST EQUAn:^;: RELEVANT ISSUES 

a:id a REVIEI-: of rece:jt research 



Gary Skastj^s & Robert U* Lissitz 



University of Maryland 



Paper presented at the Annual Meetinj^ 
of the American Educational Research Association 



Los Angeles, 1981 



OS. DEPAHTMCMT Of EQOCA-nOAl 
MATtOAlAt INSHTUTE OF EOUCAHOI* 
EDUCATIONAL RESOUHCES JNFOflMATtON 

CEN'i'Efl 1£RIC1 
^ Thi9 dDcumenl h« b««rt f«pwfuced it 

L 1 Minor cltdng«s havft been made to impiove 
repf oductioft <iuaiay 

• Poioi5o1viewo'<W)iworts*iatediftttiisdoc«' 
mem do not nec«$$atiiy f«pre$«fli oHiCial NtE 
pO$(tion or pdicy 



*'PERMtSSlON TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERfO." 




1 



TEST KQUATIX'G: RELEVANT ISSUES AND A RBVIRU* OF RECENT RESEARCH 



One of the most Important problems In measurement Is to find 
a precise means of conparln;^ different assessment procedures* 
Two such procedures^ or tests^ may be Intended to measure tbe 
same nihility* Both niay deemed to be useful In terms of their 
concurrent or pre<11ctlve validity* However^ an Important 
question Is^ how can scores on one Instrument he compared to 
scores on another Instrument? That Is^ can the scores on the two 
Instruments be linked In any meanlnf>ful way? 

This issue manifests Itself In m^iny measurement 
appllcatlonr>* One of these Is concerned with the development of 
an Interconnected series of tests* This can be In one of two 
forms* Tests cen Interctiangeable^ alternate forms designed to 
have Identical psychometric properties* Or^ tests can Indicate 
varying degrees of Intensity of a trait so that the tests can 
measure a single dimension across a wide range of ability* The 
first situation reflects horizontal test etjuatlng^ and the second 
reflects vertical test equating* 

More formal definitions of test equating have been proposed* 

According to Angoff (1971, p*562), to equate two tests Is "to 

convert the system of units of one form to the system of units of 
* 

the other — so that scores derived from form, two after 

conversion will he directly equivalent*** Scores will be 

equlvalant^ therefore, If they have the same percentile ranks on 
two tests* 



ERIC 



3 



More recentlj% Lord (I977jl980) has Incorporated the notion 
of equity Into a definition of equating tests* Equity Is met 
when It Is matter of Indifference to applicants at every given 
ability level whether thiiy are to take test x or test y*" Several 
Important requlrencints are Inpllclt In this definition of test 
equating* Flrst^ equating makes sense only If two tests measure 
the same ability* Secondly^ the equating; should he the same 
ref^ardless of vhlch test Is equated to the other* Thlrd^ the 
equating should be the sam'i reftardless of the population from 
which It Is conducted. Finally, as shown In Theorem 13*3*1 
(Lord, I?':jO, p*198), tests cannot be strictly equated unless tbe 
tests are equally tellable or perfectly parallel* 

In practice^ of course, scores are not perfectly reliable, 
and the above conditions are rarely met* A less rigorous 
definition h^s been nsed In connection with developing 
statistically equivalent tests* Under this definition, two tests 
are equated If examinees of equal ability would be expected to 
obtain the same score on each test* This has been referred to by 
Kolen (1981) as tlie definition of equatlnf^ for non-^parallel tests 
and by lifhitely & Dawls (1974) as an equating of tau-equlvalent 
measures * 

Research Into nethodi; of equating tests has been an ongoing 
process for the better part of three decades* In the past 
decade, hoi^ever, there has been an tipsut^e of Interest due to the 
application of Item response theory (IRT) methods to test 
equating* The focus of this Inquiry has been on the development 
of new equating technique and comparing their effectiveness with 



4 



3 



that of traditional approaches* l^lle representation In such 
journals as Journal of Educational hleasurement and Applied 
Psycholo^^lcal Measurement li*is Increased » even more papers have 
appeared at the meetlnj^s of many professional organizations^ such 
as the American Educational Research Association » National 
Council on ^^easurement In Education^ and the Minnesota 
Computerized Adaptive Testlnf* Conferences* 

This paper will look at the two major types of equating — 
horizontal and vertical ~ In terms of the kinds of questions 
test users are asking* in horizontal equating, the task Is to 
provide test scores which are directly comparable across test 
forms designed to have slnllar psychometric properties* The 
major use for horizontal equating has tis^ally been In devc^loplng 
alternate forms of standardized tests, such as the SAT, CRE, and 
ITBS* One may alfo ask about the feasibility of equating scores 
across different tests, for example. In obtaining CTBS 
equivalents from the ITBS* 

For vertical equating, the problem Is much more complex* 
The basic goal Is to develop scores that will link on a single 
dimension several tests of Intentionally different difficulties 
designed for groups of different abilities* It would be very 
convenient to have a way of comparing scores for examinees who 
took tests of unequal difficulty* A major application of this 
arises Iti out-of-level testing, where an examinee takes a level 
of a test that Is appropriate to their ability level but which Is 
different from the average for their group* A ma,1or focus of 
this review will be to see how far the present research has 



5 



progressed toward provldinf^ reasonable conclusions for those 
wlslilnn to use IRT methods* 



METHODS OF EQUATING SCORES 



A cotnplete understanding^ of test equatlnft research requires 
familiarity wltti so-called traditional methods of equating and 
with several aspects of latent trait theory* These aspects 
Include theoretical models^ parameter estimation procedures^ and 
equating techniques* It Is beyond the scope of the this paper to 
provide a survey of all these topics* They have been covered in 
detail In other sources * In this section^ some of these 
references will be^provlded* The authors will assune that the 
reader Is familiar with at least some of these references* 

There are two major types of traditional equating methods — 
linear and eqtilpercentlle * Other types have been proposed^ but 
the linear an<l equlpercentlle approaches have heen the most 
commonly used» and recent research has focused almost exclusively 
on them* An extended treatment of these methods has been 
provided by Anf^of f (1971) * That discussion summarizes 
theoretical distinctions, equatlnf^ designs, methods for equally 
and unequally reliable tests, and standard errors* 

VJhen uslnf^ Item response theory to equate tests, one must 
first decide on the latent trait model that best fits the data* 
The most conmonly used models are the Rasch, or one parameter 
lofilstlCj model (Rasch, 1960) and the two and three parameter 



models (Birnhaum, 1968). A summary of latent trait models can he 
found in Tlamhleton ^ Cook (1977) and flamhlecon, Swaminathan, 
Cook, Eignor, & Cifford (1078)» 

Once the model is specified, item and person paraneters need 
to be estimated* There is a ^reat deal of literature concerning; 
different estimation procedures* For test equatinfc studies, 
methods hased on maximim likelihood have heen used almost 
exclusively * For the Rasch model , imcond itional estimation 
procedures hy Wrif^ht & Panchapakesan (1969) and the BICAL program 
of Wripht ft Mead (1976) have heen used frequently* Conditional 
procedures have heen developed (see Gustaffson, 1980), but they 
have heen used only rarely in test equating research* For other 
latent trait models, the LOCIST program (Wood* UinfiersUy,, & Lord, 
1976) has been hy far the most frequently utilized program* 

U*hcn parameter estimates arc obtained, test scores can then 
he placjd on the same scale* This scale can he the ability 
scale, stfindard score scale, or raw score scale* Procedures for 
accomplishing; this linkinf> have been discussed hy Lord 
(1977,1980), rfarco (1977), and Uri^ht (1977). These are based 
primarily on a linear transformation of the ability scale. 
Different approaches have been developed recently, and these will 
be menticiied later* All IRT equating is based on raw scores* 
Lord (1980) describes two methods for obtaining; raw scores that 
can be used for equating estimated true scores and estimated 
observed scores* Almost all research has utilized the former 
method. 

Because of the volume of recent research, it would be 



T 



6 

Impossible to review every paper* We have attempted here to 
Inclucfe the most well-knowr papers as well as a fair sampllnj^ of 
lesser knot^ studies* Any omissions are not lnten«1ed to reflect 
on the quality of the papers not cited* Surely, by the end of 
this conference, a nvnber of additional papers will have ^een 
presented, and this paper may to sonte extent becone dated* A 
conclusion that will become apparent In this review is that the 
field of IRT equatlnj; is in midstream and has lonf, way to go 
toward providing* some definitive answers about these methods. 

As an aid to summarizing the following studies, tables have 
been prepared* These list the papers and classify them according 
to the test used, models used, test length and type, sample size 
and type^ method of assessment, equating design, and kinds of 
coaparlsons made* These are Intended to aid the reader in 
referrinf quickly to some of the relevant dimensions of the 
studies* 

THK RASCH MODEL 

Not surprisingly , a majority of the research on test 
equating has focused on the Rasch or one parameter logistic 
model* The simplest of the latent trait models, it provides 
several advantages over other IRT models. Probably the most 
Important of these is that the raw score is a sufficient 
statistic for estimating ability* Initial studies using the 
Rasch model investigated the so-called invarlance properties of 
the model in one cf two ways. First, two samples are 
administered ttie saute set of items and the two sets of item 

V 8 



7 



difficulty estimates compared* This is an example of person-free 
Item calibration* Secondly^ two sets of Items are r>iven to the 
same janplc and the two sets of ability estimates compared* This 
latter situation is referred to as Item-free person measurement 
and reflects a deslf,n that could he applied to test equatlnf>* 
The followlnf, stu<lles are summarized In Table 1* 

In his Initial operatlonallzatlon of the Rfisch model, Wright 
(1968) Illustrated both aspects of Invarlance with Item response 
data from the Law School Admissions Test (LSAT)* For sample-free 
Item calibration^ comparlnf? high an^ low ability samples^ Wrlfjht 
found that test calibration curves based on the two sets of 
ability estimates were very close together* With the same set of 
data J lasy and difficult subtests were formed and ability 
estlmfites were obtained for each examinee, thus assessing 
Item-free person measurement* Wright developed a "standardized 
difference score** for each Individual and claimed that If only 
random error were present such differences would have a mean of 
zero and a standard deviation of Cw* *n this study, obtained 
values were very close to zero and one,respctlvely* In this way, 
support for both types of Invarlance was found* Similar evidence 
was found by Anderson, Kearney, & Everett (1968) who, for two 
samples of nearly e<iual ability, obtained a correlation of *96 
between the two sets of Item difficulty estimates* These early 
studies therefore provided some evidence supporting the 
Invarlance claims of the Rasch model* 

In a similar study, Tlnsley & Dawls (1975) looked at four 
types of analogies tests and spmples from four very different 



ERLC 



populations* Ten comparisons were made between pairs of samples 
on tlie same test, antJ sets of difficulty and ability estimates 
were correlated » The results did not follow a clear pattern* 
Correlations between sets of difficulty estimates ranged from 
^*08 to *98* Generally, lower correlations seemed to occur 
between tbe most dissimilar samples* Ill^ber correlations tended 
to occur wltb larger sample sizes and longer tests* However, 
some notable except Ions made tbese generalizations very 
tentative* In particular, tbls study dealt wltb some very small 
Samples and sbort tests* In tbls study, ability estimates 
between Identical raw scores were correlated for pairs of 
samples* In all cases, tbls correlation was *999, even tbougb 
Item difficulties correlated as low as -*08* As noted by Wbltely 
(1977) and Dlvgl (1981), test and Item calibration are relatively 
Independent concerns* 

In both tbe Anderson and Tlnsley papers, an attempt was made 
to assess tbe effect of removing Items tbat did not fit tbe i^ascb 
model* In tbe Anderson study, tbls resulted In an Increase In 
tbe correlation between Item estimates* In tbe Tlnsley ^ Dawls 
study, changes In the correlation coefficients were Inconsistent 
as misfitting Items were removed* Some correlations decreased, 
and some Increased and tben decreased* In fact, one correlation 
decreased from *88 to *18 as misfitting Items were removed Of 
course, tbls shortened the test considerably and probably made 
calibration less reliable* Clearly, an iiitjx>rtant variable In tbe 
equating outcome Is the model upon which the test Is constructed* 
Unfortunately, tbls topic has been largely ne;!lected In tbe 



9 



eqtintlnf, literature (see Cook & El^nor^ 1981)* 

One of the first studies to conp^ire different sets of Items 
uns conducted hy Whltely ^ Dawls (197A). These Inve^tlf^ators 
divided n 60-ltem verbal nnalofxles test Into two jO-ltem subtests 
In ttiree different wayn: odd and even Items^ easy and hard Items^ 
and random subsets of Items. Tbey assessed tbelr results In 
terms of tbe two aMllty estimates for eacb examinee^ one from 
encb Item set* Hrlgbt's standardized difference statistic was 
used to summarize the results. For tbe odd-even and random sets 
comparisons^ the menns and standard deviations were very close to 
zero and one^ respectively* However, In comparing easy and 
difficult Items ^ the variance of tbe standardized difference 
scores was significantly greater than one . The authors 
suggested that poor fit to ths Rasch model may have Influenced 
this latter result. It turned out thnt 57% of the hard Items and 
23% of the easy Items did not fit the model (A0% for the entire 
test). Reasons for the misfit were not studied. Presumably, 
more guessing occurred on the harder test. At any rate^ the 
authors concluded that tbe Rasch model was to some extent 
Invariant for this set of data even though the Items were fairly 
deviant from the Rasch model. 

t^ille the above studies Investigated sltuntlons relevant to 
test equating, tbey were not In fact equating studies since no 
sets of Items were actually linked together. In 197A^ tbe final 
report of the Anchor Test Study (ATS) was published by Loret, 

Seder^ Blanchlnl^ £ Vale. This was a large-scale <iquatlng of 

! 

several forms and levels of seven published reading test 



ERIC 




10 



batteries * Linear and cqulpcrc entile approaches were used to 
link raw scores on these tests* While Item response theory was 
not used In this study » It did provide the data used In several 
later studies* 

The first of these Involved an application of Rasch 
procedures to the ATS data by Rentz & Bashaw (1977)* Rentz & 
Rashaw developed the National Reference Scale (^^RS) for reading* 
This scale provided direct raw score comparisons for 28 
test/ level combinations* This, In effect » treated ttie 2,6AA 



ItemiY on all tests as a calibrated Item pool» any subset of which 



could produce a score on the NRS* Assessment of the adequacy of 
the equating process was done by looklnf^ at the variability of 
ability estimates across repeated administrations of the same 
test* This was accomplished because each test was administered 
to somewhere between 14 and 28 samples* The standard deviation 
for each raw score ability level was computed and averaged for 
all raw score ftroups to provide a slnf^le Index for each 
test/level combination* These ranged from *008 to »0A1 loglts* 
Relative to the ability scale Itself and to the standard error of 
an Individual's ability estimate^ these values were quite small* 
Therefore^ the authors concluded that there was sufficient 
Invarlance to justify the Rasch equating* 

Up to this pointy the research has shown some support for 
the validity of the Rasch model * With tests of similar 
difficulty and samples of comparable ability^ Rasch horizontal 
equating seems to provide reasonable results* To he sure» this * 
provides two major Improvements over traditional methods because: 




11 

1) Item subsets can be tailored to specific groups and 2 ) 
statistically equivalent forms can be developed despite 
unintended differences In difficulty* On the other hand, these 
studies do raise some questions abotit the limits of Rasch 
Invarlance* The IJhltely ^ Dawls study suggests that ability 
estimates were not quite as Invariant when subtests were 
deliberately different In difficulty* Tlnsley & Dawls point to 
the potential problem of small samples and samples which are 
widely different from one another* 

Some of the above difficult les have been Investigated 
throup.h vertical equating where tests of unequal difficulty and 
samples of tmequal ability were employed* In an analysis of the 
Anchor Test Study data, Sllnde ^ Linn (1977) found that vertical 
equating tisln^x the equlpercentlle approach n=5;ulted In large 
discrepancies in grade-equivalent and scaled scores for the same 
examinees on different levels of a published test* The authors 
suggest using latent trait models for vertical equating* 

SI inde & Linn ( 1078) conducted a vertical equating 
investigation with the Rasch model that was both a replication 
and an extension on one part of the l^itely & Dawls and Wright 
(1968) studies* Whitely & Dawls noted ttiat the variance of 
standardized difference scores on easy and difficult subtests was 
slightly larger th£in would be expected with random purely 
measurement error* Slinde it LJnn replicated this situation* In 
addition^ they investigated tha stability of equating when item 
difficulty estimates were obtained from one sample and then 
applied to a different sample* In practical vertical equating 



13 



12 



studies^ these two samples may differ widely In ability* 

In the Sllnde Linn study, a 36 Item achievement f:est was 
divided Into easy and difficult subtests* A sample of 1307 
Incoming coll<^ge freshmen was divided Into high, medluo, and low 
ability f^roups based on their performance on the easy subtest* 
The high and low groups only were used for Item and ability 



estlm^ites* It turned out that the two subtests yielded similar 
results for a group (using Wright's standardized difference 
Index) when that same group was used In the test calibration* 
This corroborated the results of Uhitely & Dawls and Wright* 
However, this did not occur when a group other than the one for 
whom the results are applied was Involved In the p^irameter 
estimation process* Substantially different ability estimates 
resulted fron the two subtests, as much as 1*2 log ability unics 
(loglts) * In other words, an examinee's equated score on 
different levels of a test varied depending on the ability levol 
of the sample on which the equating was based* This was a clear 
violation of Rasch model Invarl^ince* 

To be sure, the cooparlsons Involved In this study were 
severe* The easy and difficult subtests differed by almost two 
raw score standard deviations* The high and low ability groups 
differed by about 1*8 loglts* Nevertheless, some limits to Rasch 
In variance were demonstrated that cast doubts on Its use In 
vertical equating* 

Giistafsson (1979) criticized Sllnde ft Linn for dividing 
their sample Into ability groups based on performance on the ea»y 
subtest of the same test used for equating* Gustafsson showed 




13 



throiif^h a simulation that a spurious lack of model fit could he 
Introduced In such a situation due to a regression artifact* 

Sllnde & Linn (1979) conducted a reanalysls with a set of 
data drawn from the Anchor Test Study^ using fifth graders* 
Three ability groups were formed based on oata ffom the 
California Tests of B^slc Skills Rending Comprehension subtest^ 
hut <in<ilysls was carried ouf on the SRA Reading Test* The 
procedures were the same ar before except callbtatlon was <ilso 
done for the middle ability f^roup^ providing an additional 
comparison* The results generally supported the e<irller study* 
Moreover ^ widely different ability pstlmates were obtained 
whenever the low group tvas used either for calibration or 
comparison* Results Involving only the middle and hl^h (groups 
showed comparable ability estimates from the two subtests* Thl' 
led the authors to conclude that guessing played a role In the 
poor results^ a conclusion shared by Custaffson (1979)* The 
authors noted correlations of -*68 £»nd -*38 between Item 
difficulty and discrimination Indices for the low and middle 
groups^ respectively^ Such a negative correlation Is Indicative 
of failure to estimate non-zero lower asymptotes* This would 
Imply that a more fully-parameterized model would have provided 
better results* It also Implies that Rasch vertical equating 
might give better results In situations where guessing Is 
minimized * 

Several researchers have examined the viability of Rasch 
vertical equating when differences between Item sets and ability 
groups were less extreme* Loyd & Hoovpr (1980) used three 




levels of Iowa Tests of Basic Skills (ITBS) Math Computations 
Test and three samples ci" pupils from the sixth through eighth 



non-'ad jacent levels using the three samples as separate 
callhratlon (groups Their results supported the Sllnde R Linn 
studies In that the cquatlnf; he t ween ^ny two levels was 
Influenced hy the group upon which the equating was hased* There 
was no definite trendy except that perhaps » as In Sllnde & Llnn» 
an eXi'tmlnee would receive a higher ability estimate If he/she 
took the test at the level of the calibration group* 

In looking; for causes of the Inadequate Rasch equatlngs^ 
Loyd Hoover were concerned that curriculum content across grade 
levels^ particularly In mathematics^ might not represent a unl-^ 
dimensional scale* To Investigate thl^ » they performed a 
principal axis factor analysis of the to^al Item set * The 
analysis showed that more than one factor was present In the 
total Item pool* Since unldlmenslonallty Is a basic assumption 
for IRT models^ these results raise questions about the use of 
such models for certain tests * On the other hand » 
unldlmenslonnllty Is Implicit to test equatlnf^ In general* For 
two tests to be equated In a meaningful way» they must measure 
the same trait* It may be that certain types of tcsts^ such as 
curriculum-based tests, can not be equated because what they 
measure changes from level to level* 

Finally, Cuskey (1*>81) recently conducted a Hasch vertical 
equating with the ITBS Reading Comprehension Test"levels 9-lA* 
These tests and levels were also used by Rentz R Bashatf^ (1977) ^ 



grades * 



Rquattngs 



were 



conducted 



across adjacent 



and 



ERIC 




15 



Adjacent levels were equated^ using only one calibration group at 
each levels Thtts^ issues raised by Slinde ^ Linn and Loyd & 
lloover concerning, cross-validation were not addressed here* 
However » Cuskey compared tbe Rascti abili ty scale to the 
publisher's parade-equivalent scnle for rai; scores on each of the 
test levels. The two scalers, not surprisinj^ly, differed widely 
nt extreme ability levels* Moreover, in one area neat* the mid<fle 
of the ability scale, the three lower levels (9-11) of the test 
unexpectedly produced lower Rasch ability estimates than the 
three hij^her levels (12-14)* Additional evidence sup,gested that 
the fZasch estimates were more indicative of the actual abilities 
of the examinees than f;rade*-equivalent scores* tJhile these 
results do not challenge sample invariance, they do sugsest a 
clenr improvement over publisher's grade-equivalent norms* On 
the other hand, Cuskey minimized guessing by using only high 
ability examinees* The data therefore probably fit the Rasch 
model reasonably well* 



Following research on the Rasch model > several researchers 
hvve quite naturally investigated the applicatica of other latetit 
trait models, most notably the three parameter logistic model, to 
test equating* This research has focused on comparing different 
strategies using the same data set, the most frequent comparison 
being the three parameter versus thr one parameter logistic 



THE TIIRRE PARAMETER LOGISTIC AND OTHER MODELS 




(Rasch) model* A large portion of this work has been conducted 
at Rducatlonal Testing Service under tJie direction of Dr » 
Frederic Lord* Perhaps because these papers tend to be more 
complex and larger In scope than thoe for the Rasch model alone » 
they are not as numerous* The studies that are discussed In this 
section have been summarized In Table 2* 

One of the first comparative studies of this sort was done 
by Marco, Petersen, & Stewart (1979)* This was a very large 
study, so it will discussed In some detail* Forty linear, two 
ecittlpercentlle, and the one and three parameter logistic models 
were examined under a variety of conditions Including random and 
dissimilar samples. Internal and external anchor tests , and 
different types of criterion scores and summary statistics* For 
any single comparison, only the best (least error) linear model 
was pres'snted * Generally, there uere two distinct studies In 
this project **** one. In which a test was equated to Itself 
(horizontal equating) and two. In which tests of unequal 
difficulty were ecfuated (vertical equating)* In all situations, 
an anchor test design was used where one form of a total test was 
administered to one i;roup of examinees, a second form given to a 
second group of exaalnees, and a common anchor test given to both 
groups * For evaluating the adecfuacy of the equating , two 
statistics were devt^loped* Total error, or mean square error, 
was defined as a sum of squared differences (weighted) between 
ecfulvalent scores* Secondly, scfuared bias was defined as the 
mean scftiared difference between equated scores* It Is easily 
demonstrated that squared bias Is a part of the total error* 




TABLr. 1 . 
RASCn MODi;i STUDTl-S 



PAPER 


TICSTS 


irmr, 

(number =10 


f^A'lf*Li:S ^" . 
(tiuPibor=^n) 1 




I,r>AT 


k=A8 


' \ 

1 

lilt; studentH; n^976 j 


Anderson et» al» (1968) 


Intelligence scrcenln;', 
test 




Australian armed forces; 
n,-f»08 ; n^==87^t 


Wliltely a-u/ls 


unpuhltslied verba] 
analo;;tes test 


k"60 ; 30 items per subtest 


co]lej^,e and high school 
studentt;; n=9A9 


Tlnftlcy Dawls (1975) 


unpitbl^slied verbal 
nnalofrles 


4 te^ts: picture, word, symbol^ 
numbers; k=23-60 per test 


A samples; high school 
ff collef*e students, Voc» 
Rehab* clients, civil 
service emjiloyees 


Ront^ & fiasliaw (1977) 


CAT, cm, ITBS, MAT, 
STEP II, S]XA, SAT: 
see Anchor Test Study 


2 forms of each test: 
k=59-]21 i)er test; 
k=2»6/iA total 


Ath thru 6th graders 
n=1300-2O00 per sample 
42 total f;amples 


Slinde Liun (1978) 


CEKI*; Math Achievement 
Test Level I 


n"36; 18 per subtest 


Incoming fref;hmen college 
n=l,307 total; 3 subsample; 


Sllndc Linn 0979) 


SVJi — Blije Level from 
Anchor Test Study 


comprehension and vocabulary 
k=A8,12 (rcspecltlvely) 


5th f*raders; n-l,63R 
3 subsaiTiples 


Loyd «i Hoover (19iS0) 


ITltS; flath Comprehensl< 
Levels 12-lA 


n k=/l5 per level; 30 Item 
overlap between adj* levels 


6 til thru 8th jjraders; 
n*=l,95f) 


f^nskey (1981) 


ITI**;; Readlnf* 
Comprehension; 
Levels 9"1** 


k=f>0-ao perleveJ; 33-58 Item 
overlap between adj* levels 


6th thru !Uh p^raders 
of hl;:h ability; 
n"(>,tJiJij; 1 per level 



o 19 

■ERIC ^ 



20 



TARIJ 



,E 1 (contd.) 



PAPRR 


rvpK OF EoliATiiir: 


MF/nioi) OK ASSK.s;;rTn?rr 


cnMPARLsorij; maul 


lliriRlit (1968) 


horl^ontfil 


ft vertical 


standardl?;ed difference; 
coniiarfj^on of test calibration 
curves 


ability e#;tlmates from 
easy ft hard subtests; 
difficulty Qstlmntes from 
"smart" & "dtmb" sample?? 


Anderson et^»,<il, (196S) 


5;amples of 
ability 


ef]ual 


correlations between sets of 
estltnates 


difficulty estimates from 
two samples 


Ifliltely ft Dnwls (197A) 


horlzontnl 


ft vertical 


stand/ird difference scores 


ability estlTnatcs based on 
easy vs* Hard; odd vs* evei 
random suhsestf; of Items 


Tins ley ft t)awls (1975) 


samplos of 
ability 


different 


correlations between sets of 
ability tt difficulty estimates 


10 comparisons: pairs of 
samples responding to the 
same test 


Rent^ h Bashaw (1977) 


horl2ontijl 


ft vertical 


standard deviation of parameter 
estimates across ltf:ms U porsons 


All test forms placed on 
the Rasch ability scale 


SUnde h Linn (1970) 


vertical 




standardized differences 


cross-validation of 
ability estimates 


Sllnde Linn (1979) 


vertical 




standardized differences 


cross-validation of 
ability estimates; 


Loyd ft Hoover (t980) 


vertical 




graphic presentation 


cross- validation of 
ability estimates 


CtJsUey (19al) 


vertical 




correlations and mean 
differences between Rasch 
and n-K scores 


comparison of Rasch ability 
scale with publisher's 
r^rade-'equlvalents 

_ 



22 



TABLE 2 



TimEE PARAIIICTER Alll) OTHEK MODl-LS 



PAPER 


TESTS 






1 1 J M lO 


C* A tf ttT 1 *C* 

NAM! l*hr> 








- » 


^numucr— 


(numher=)i) 


Marco, Peter«e 


I, SAT ~ Verbal 




Rascli , three parameter, 


antonym, analogy. 


hifjh school students 








two enilpcircent ilt! , 


.sentence completion , 


(mostly) 


(1979) 






forty linear rtctlioOs 


reading comprehension 


2 ScDnples/lOsuhsnmples 










k=5'i,85 (total test) 


u=l,r>77 per subsample 










k=20,3/t-^0 (anchor) 


randon and dlssmllar 












suhsamples 




LTlAf — f)tll ft 




Linear, equlpercentlle , 


vocahulary , quant 1 tat Iv 


2 9th thru 12th fjraders 








one, two, and three 


Jc=/iO (vocab*) 


n'l, 579-1, 925 








parameter models 


(quanta) 


I>er f^rade/form 










7tli ed» has 2 levels 


comhlnatlon 


Petersen, Cook 


SAT — Verbnl 


A Math 


^ linear models. 


Verbal (see ahovc*) 


hl^h school .students 


& Stockliif; 






equips rcentlle. 


^fath: math, data 


(mostly) 








three parameter model; 


k=60 


n=2,G7'l per f>ample 








partial pre-* 




5 samples 








calibration ^t 












concurrent calibratlor 






Gonk, Diinbar, 


SAT/PSAT-tlnt. 


Merit 


2 linear methods;. 


Verhnl and ^fath 


mostly hl*;h school 


& Elgnor (I98t; 






equipe rcentlle, three 


(see above) 


f;ttideiits 








parameter model 


PSAT: k*=65 (Verbal) 


n=2,000 per sample 










k=50 (Ifath) 


3 samples 



ERIC 



24 



.0 



TARU; •}. (contd.) 





TYPr, OF l;01fATtnC AND T7)tJATirjn DKSICn 


ifniioDS OF Assi:f;snn:iT 


Ma rco , 

I'e terpen, U 

J^tcwart (1979) 


A) Horizontal cquatlnj*: Gquatliirx a to<?t 
to itself thrcuph anchor tofits; 
a) Internal anchor Lcsts 
1)) external anchor tusts 
c) two internal anchors dlfCerlng 
In (llCClcnlty Crow total test 


A) CriL^^rlon score; ts^st scort* Ilfjelf 




Tl) Vertical eqnatlng; total test*? oF 
(UFferGnt difficulty; 

a) tIiron<;h anchor of IntertnGcllatG 
(llCClcnlty 

b) two total tests Gqu/itec) directly 


H) Criterion scores calculated In two ways: 

a) IRT eqnlperceiitlle 

b) direct equlpercentlle 

For both A and B two fsunmiary sLatlj^tics 

calcniated: 

a) mean square error (total error) 

h) squared bias (mean difference squared) 


KolGU (1931) 


A) Horizontal Gquatinpj 6tl; ed* fi 7th 

Gd. (J.GVCI IT) 

B) Vertical equatln^^; fith ed* h 7th 
Gd» (heVGl I) 

Test fonn<? randomly asslp.nGd to GxamljiGGS, 
Lherefore> randomly equivalent ;^roiips takinf* 
each test 


Cross-validation statistic; Lotal mean 
.square error applied to ;in Independent 
sample 

Orlpjlnal total test .score used a<5 criterion 


Pctcr^GH, Cook, 
h Stocklnf, 


External anchor test — fscale drift: 
test equated to itself throiijjh five 
intervening forms , each with a separate 
anchor te.st 


Suininary stoti.'itlcs: Mean sqnnrn error 
and squared bias (see abovo) 


Cook, Dunbar, 
U VX%TKoz (1981) 


Internal anchor test; old SAT and new PSAT 
have Items in common* 


Mean ^square error and squared bla<? (see above) 
three parameter modol scores used as criterion 



o 

ERIC 



25 



26 



TARLK 2 (cont<t.) 



PAPKR 


TESTS 




(mimber^k) 


(mimber=u) 


Coweli (lOfll) 


TOEFL ~ ^ forms 
SI.EP 


modified and Cull 
three pnrameter, 
RascJi, and linear 


Written I^xpre^^lon, 
Rciadiny; comprehension. 
Listening; 
k=A0,65,5O (respO 


lar;*e and small Sfimplef 
n=2,06')-3,Ji72 (larpe) 
u=292"3l7 (small) 


'■■mitncy (1981) 


Development (GF!!)) 
12 forms 


Aft 11 ^ ni^m^nttlf^ IfnAfti' 

Rasch, three parameter 


J>ll[l Lt^f> L-t>* ItIlX^XIIi,, 

^cldnce, *social studies 
readlnf^> math 
k*30,60,60,AO,50(re^pO 


nf1ti1t*c — ^ tiifth ^rhno 1 

ocjuivalency « 
n=205 per form 
n=20 In crofi^- 
validatlon 


Patience (1981 


) XTKD — Expression 


llasch^ tyo^ and three 
parameter models » 
ec]ulper cent lie 


Orrectlons, Spelllnj; 

k=A9,U (respO 

k'^f) (anchor) 

total test divided lnt< 

easy, medium, and 

hard subtests of 23 

Items each 


9th thru 12th fjraders 
n=l,000 per^ ftrade 



27 



28 



TAIJMi 2 (cont<K) 



1 



TYPE OF EQtrATINC AI4I> KQUATIHC ORSICH 



rmTHoiK OF As:;i;5sni-:iT 



Cowell 



Vatlcnco (1981) 



Kolen & 
IfliltnGy (19;il) 



Internal finclior test: etjuatlnn alternntG forri< 
uitb common Itemf; 

Gom;>arJsonft made bctvcen pairs of models; 
£«nd sample: sizes 

Horizontal eqtiatlnfj; each examinee ^jlven 
o!ie test form an<1 anchor test 



Vertical equating; calibration based on 
12t1i parade sample, Kasy, medlumv and hard 
subtests are equated with 9th thru 11th 
grades , respectively 



Stimmary statistics; mean Sfpiared difference, 
mean ahsoltited difference, squarrid bias, 
maximum absolute dlf f *?reiicc, variance of 
differences 

squared bias and Imprecision Indices applied 
to cross-*valldatlon sample 
ANOVA; forms X methods 

anchor test score is used as the criterion 
score. 



correlation between derived and obtained 
scores 

Score obtained from orlf^lnal data set 
tised as criterion score (orlj^lnally, 
all examinees responded to all Items) 




29 



on 



18 



For the horizontal equating part of the study^ M^irco et* al* 
found that» when tlic anchor test was equal In difficulty to the 
two total tusts^ the linear and IRT methods performed well* Ulth 
an Internal anchor^ the equlpercentlle approach also worked well* 
Wnii an external anchor, the Rasch model did sllf^htly better* 
VJlth a parallel anchor test, the type of sample mattered very 
Jittle* When the anchor test was easier or more difficult than 
the total tests, random samples showed very little error* On the 
other hand, the IRT models were vastly superior to the 
traditional nethods with dlsstnllar samples (samples of unequal 
ahlUty)* Neither IRT model was clearly superior to the other* 

When tests of unequal difficulty were equated, the best 
linear method displayed large total errors, followed by the Rasch 
model* The three parameter model performed with the least amount 
of error when IRT-h£>f;ed criterion scores were used In the 
equating* The equlpercentlle method was the best method when an 
equlpercentlle criterion score was used* This Indicates some 
degree of bias In the criterion (the authors alfso Indicate that 
the horizontal equatln;t criterion score may have favored the 
Rasch nodel) * This would present a serious problem In 
Interpreting^ the results* 

The results of this large study clearly Indicate the 
superiority of IRT methods In horizontal equating where samples 
are not randomly chosen* In practice, that Is usually the case* 
For vertical equating, the Rasch model produced a large total 
error, a finding which Is consistent with the Sllnde ft Linn 
(1978,1979) and Loyd & Hoover (1980) studies* Findings for the 



ERIC 




cqulpercentlle approaches were puzzling* This approach worker 
fairly well usln;^ cqulpercentl Ic criterion scores* This 
conflicts with SllnHc Linn (1977) who showed poor results with 
that approach* ^t should be noted, however, that entirely 
different test batteries were used In the two studies* As will 
be dlecussed later, this Is potentially a critical factor In 
comparing equating studies* 

The three parameter'^ model was far superior to the one 
paraneter model for vertical equatlnp In terms of total error, no 
matter which criterion was used* This Is not surprising since 
the SAT Verbal Items are known to be fairly difficult* The 
de(;ree of {;uesslng that could result would seem to suggest that 
estlriatlng iower asymptotes of Item characteristic curves will 
reduce total error * Interestingly, the one parameter model 
showed less squared bias than the three parameter model when 
equating was done between easy and medlun difficulty tests and 
considerably moi'e squared bias when the equating was done between 
medium and hard tests* Between easy and hard tests, the three 
parameter model showed less squared bias and total error* 
Perhaps, with the easier tests, less guessing occurred, and the 
one parameter model therefore provided closer fit to the data 
than It would with the pore difficult tests* This conclusion 
would be supported by the Rasch model research reported earlier 
(Cuskey, 1981) where the effect of guessing was minimized* 

In another recent study; Kolen (1980) expanded on Marco et * 
al, by studying a number of IRT models as well as a linear and 
equlpercentlle method* Kolen equated vocabulary and quantitative 



ERIC 




20 



tlilnklnf. Items separately In two editions of the Iowa Tests of 
K6u\. itlonal Development * One of the editions contained two 
difficulty levels, the more difficult of which was equivalent to 
the other edition* This provided both a horizontal and vertical 
equating; comparison* With one, two, and three parameter models, 
two types of IKT eqiiatlnf, were studied — estimated true score 
oqiiatlnf, (Lord, 1980; p* lf)9) and estimated observed score 
equating (Lord, 1980; p* 202)* In addition, a modified Rasch 
model was used In which the common discrimination (slope) was 
allowed to vary between the two tests being equated* Another 
uniqueness to the Kolen study was the use of a cross*valldatlon 
sample and statistic* The tests to be equated had no Items In 
common, and each test was administered to an Independent, random 
sample* Therefore, there was no repeated measurement anywhere In 
the design* But since the samples , Including an Independent 
cross- validation j^roup, were randomly assigned to their tests, 
the expected ability distributions were the same* The statistic 
used for evaluation was a welf,hted mean square error of 
differences In equated scores for the cross-^valldatlon samples* 

The results obtained with this statistic are somewhat 
confusing,, suf^gestlnj* a complex Interaction between Item content, 
difficulty level, and the model* For vertical equating, the 
linear and Uasch models performed poorly, supporting previous 
research » Also, the three parameter model and equlpercentlle 
models did very well* Of the two IRT equating methods, the 
estimated observed score method was slightly better than the 
estimated true score method * For horizontal equating, the 




21 



results were more Inconsistent, There was a large difference 
between vocabulary and (quantitative Items and between supposedly 
alternate forms. In f,eneral » tbe estimated true score metbod for 
tbe three parameter model produced tbe best results, but tbe 
linear model (quite well- Tbe estlmatetj true score Kascb 
model <lld well for tbe quantitative Items, 

This study supports previous researcb on the lnade(iuacy of 
uslnp^ tbe Rascb model for vertical equating, Tbe Marco and Kolen 
studies suggest that tbe tbree parameter model performs better In 
a variety of situations, Kolen attributes tbls to failure to 
account for guessing. For tbe tbree parameter model, be notes 
tbat tbe difficulty In obtaining true score equivalents below 
cbance level may bave been tbe reason for tbat method to perform 
relatively better at vertical equating. Also, there may he 
problems with the LOCIST program In assessing tbe lower asymptote 
parameter, wblcb In tbls study had a differential Impact on tbe 
two difficulty levels of the tests, 

Ultb regard to vertical equatlnf>, most of the researcb that 
has been discussed has demonstrated the superiority of tbe 
three**parameter model over tbe Rasch model. Failure to account 
for lower asymptotes has been cited as a primary reason for this. 
An Interesting^ contradiction to this generalization can be fonnH 
In a study by Patience (1981) using the F,xpresslon Test of tbe 
Iowa Tests of Educational Development CITED) and samples of ninth 
through twelveth graders. 

In Patience's study, scale scores were obtained from all 
examinees to all 63 Items, These were tbe criterion scores used 

34 



22 



for comparison aj^alnst scores derived through equating* The 
total test was divided Into hlgh^ medium^ and low difficulty 
subtests* Item responses of eleventh f*raders to the hard test» 
tenth traders to the middle test» and nlntfi f^raders to the ea&y 
test» and an Internal anchor test of six overlapping* Items 
between adjacent levels formed the basis for the equating* There 
were 1,000 examinees at each crade level* Orrelatlons between 
equated ability estimates were used to compare the eqtilpercentlle 
, one, two, and three parameter models* In terms of these 
correlations, the three parameter model was outperformed by the 
other three models* Even with below chance level scores removed, 
the correlation for the three parameter model was lower than for 
the other methods* Patience offers several reasons for the 
results, short test lenf*t1i, small sample sizes, and lack of 
unldlmenslonallty, tfie last Issue of which would preclude IRT 
equating* In addition, the correlations produced here were 
based on ability estimates from the 25 Items of moderate 
difficulty for each sample* These estimates were correlated with 
ability estimates based on 63 Itens of which the previous 25 x^re 
a subset* For ninth f*raders, ttie total test was more difficult 
than the subtest* For eleventh graders, tire total test was 
easier* A spurlousness to the correlations could have been 
Introduced by the overlap of Items on which each ability estimate 
was based* There also could have been an effect due to grade 
level* It would be Interesting; to recalculate the correlations 
within each grade level separately and with overlapping Items 
from the total test removed* 



ERLC 



35 



23 



Several studies liave been done that are concerned with 
comparinf^, equatln;^ methods (Cook, Dunbar, & Eljjnor, 1981; Coi^ll, 
1981; Kolen Whitney, 1981; Petersen, Ojok, & Stocklnp, 1981) » 
Each has focused on different aspects of the equating probletn* 

Snail sample size was suggested by Patience to be a 
contributing factor In the poor performance of the three 
parameter model because the estimation procedure had difficulty 
converging* Sample size as an Independent variable was studied 
by Cowell (!981) with the Test of English as a Foreign Language 
(TOEFL)* Cowell conpared several IRT models with large samples 
(2,000-3,000) and small samples (about 300) » In equating 
alternate forms of the TOEFL, differences between equating 
methods and samples sizes were quite small* The tests were 
probably very similar In difficulty and samples t;ere probably 
equal In ability * In this study, stable three parameter 
estimates i^re produced by the small samples* Scores derived 
with the linear model were used as the criterion scores* the 
linear model * Discrepancies resulting from using small as 
opposed to large samples were less than discrepancies resulting 
from using tlie one as opposed to the three parameter model* Tills 
finding was clouded somewhat by using the linear model as the 
criterion, so that discrepancies represented agreement between 
ttte models rather than adequacy of the equating* 

On the other hand, a study by Kolen & VThltney (1981) using 
the General Educational Development Tests (CEP) found that with 
small samples (170*198), a number of extreme Item parameter 
estimates were produced with the three parameter model* This 



ERIC 



36 



24 



sug(;csts problems with the estimation procedure that contrlhiited 
significantly to e<iuatlnf, error, a finding; consistent with 
Patlence^s results. This study Involved a horizontal equating of 
alternate CED test fonns. It seems likely In these studies that 
the dnta fit the three parameter model to varying degrees. 
Patience, for example, decided Initially to eliminate 12 of his 
orlf^lnal 75 Itetns due to nonconvergence of Item parameter 
estimates . 

One source of error In the equating process for IRT methods 
Is the translation of Item difficulty (and ability) estimates for 
different sets of Items to the same scale* Theoretically^ this 
can be accomplished through a linear transformation Involving the 
means and standard deviations of Item difficulties* Petersen, 
Cook, *t Stocking (I9SI) compared three translation procedures for 
the three parameter model with six editions of the f>AT Verbal and 
rfathematlcs Tests* Three linear and one equlpercentlle method 
were also compared. The de5%lgn of this study was fairly unloue. 
An original SAT was equated through five Intervening forms to 
Itself. Each palrwlse equating was done through an anchor test 
design of overlapping Items. The Initial test thus served as Its 
own criterion * One of the translation procedures , called 
concurrent calibration^ was a simultaneous estimation of all Item 
responses from each pair of tests (Including the anchor)* For 
the other two methods, called partial pre-^callbratlon, 
calibration was made separately for each test/ anchor 
combination* In a "fixed b's" procedure^ item difficulties at 
one step were calibrated * For overlapping Items » Item 



ERIC 



37 



25 

difficulties vteze fixed for tfic next calihratloTi » tlnis forclnfr 
all tests to he placed on tlie same scale* For an "equated b's** 
procedure^ each test/anchor combination was calibrated separately 
and then linked through a sequential linear transformation* 
Alonfj with appropriate fjraphlcal presentations^ Petersen et* al* 
used a weighted mean square difference (total error) and mean 
difference s<)uared (squared hlas) as summary Indices (the same 
statistics were used by Marco et . al»,1979)» 

The results first of all showed that the Verbal and Math 
tests responded quite differently to the same equating methods* 
For the Verbal tests^ the equated b's method was surprisingly 
close In predlctinf; Initial scale scores* The other procedures 
all overestimated Initial scores^ and the linear methods were all 
fairly close to one another* The equlperccntlle method had the 
larfi:cst total error* Moreover^ the other methods were systematic 
In their overestlmatlon, while the equlpercentlle approach showed 
a very erratic pattern of error* The total error for the 
traditional methods was at least three times that for any IRT 
method* These results were generally consistent with those of 
Marco et * al* (1979) for the SAT Verbal test* However, the 
models used In each study were not Identical* 

For the Math test, ttie equated b's method had a total error 
ttiat was more than three times that for any of the other methods 
and more than double the scaled score standard deviation* Two of 
the linear methods — 1-evine's Equally Reliable and Unequally 
Reliable methods (see Angoff , 1971) and tlie concurrent 

calibration method performed with the least amount of error* The 



26 



cqiil percent lie approach ap.aln showed an erratic pattern of 
errors* All methods except equated h's overestimated scores* 
All TRT nethods underestimated at the lower end of the ability 
scale and overestimated at the upper end* 

Reasons for the Inconsistency between Verbal and Hath tests 
are not clear* There could have been a difference In the degree 
of parallelness between the test forms, or perhaps a difference 
In the degree of model fit* Item responses for reading; 
compreVienslon Items that are {grouped around ^ passage are 
probably not locally Independent* Another possibility Is that 
the differences could be due to random fluctuations of the tests 
that happened to be chosen* However, the similarity of the 
results for the Verbal test to those of Marco et* al*. tends to 
sup.gest that some sort of more systematic process Is underlying 
test content In these cases* 

In another comparison of verbal and quantitative Items, 
Kolen (1981) also found substantial Inconsistencies between the 
two types of tests with regard to the relative adequacy of the 
different equating models* Kolen & Whitney (I98I) found smaller 
discrepancies between several different types of achievement 
tests* On the other hand, they noticed some differences In the 
factor structures of their tests* 

The studies discussed so far have used a wide variety of 
tests* How ttie content of the test might have affected directly 
the equating resu] ts Is not clear* However » underlying, 
differences may be Important* Studies that have factor analyzed 
their tests have shown thfit more than one substantial factor 



ERIC 




27 



typically exists^ thus violating unldlmensloTiallty* If» for the 
above or other reasons, the content of ttie test affects equating 
oittcone, as seems likely, then It becopes very' difficult to 
compare results across studies where different tests arc used* 
The recommendation for a practloner to use thp method that gives 
the best results for a particular equating seems questionable 
because we do not know as yet how well such results will 
generalize to new samples* 



In the research discussed above, conclusions are derived 
from teal test data* While the data has the desirable property 
of beln^ representative of actual test results. It suffers fron 
several major drawbacks* The sample sizes required for a test 
equating study almost necessitate the use of a data sets from 
large testing projects* Because of this. Independent variables. 
Such as samples sizes, test lengths. Item factor structures, etc* 
can not be actively manipulated by the researcher* In every case 
cited above. It is difficult to Interpret results because the 
superiority of a particular method could be due to sample size, 
poor data fit to a model, criterion bias, the content of the 
test, multldlmenslonallty, and many other factors* In nost real 
data cases, one cannot unconfound the Influence of these factors* 

Fonte Cflrlo research offers the possibility of being able to 
manipulate Independent variables In an experimental fashion* The 



^*o^^rE carlo methods 



ERIC 




28 

mnjor drawback to such methods Is of course slmulatlnf; data that 
Is realistic* In different tcrms^ simulation can provide answers 
to many questions of Internal validity with regard to equating; 
research* At the same time, Its major limitation Is provldlnf* 
enoup.h external validity^ or general Izeahlllty* Rfilatlve to 
empirical research, Monte Carlo studl'es^ have received little 
attention In the IRT literature* Most of the work has dealt with 
paraneter estimation and robustness of models* To the authors' 
knowledge, no significant simulation of a test equating has been 
done* Still) some of the research has Impl IcatloTis for equating* 
Curry, Bashaw, ^ Rentz (1978) Investigated tfsc robustness of 
Rasch ability estimates when the equal discrimination condition 
was violated In a mimber of ways* They studied differences In 
the shape of the ability distribution, the difficulty of the test 
relative to the sample, the percentage ol Items fitting the Rasch 
nodel) aTid the degree of misfit for Items not fitting the model* 
hflsflt was described In terms of Item discriminations UTiequal to 
the average discrimination* In comparing the estimated ability 
f.'>r each examinee with the true value, the error associated with 
a data set that perfectly fits the Rasch model wa^ used as a 
ya rdstlck* 

The results suggested that estimated abilities were fairly 
close to thefr original value In most sltuatloTis of misfit, 
relative to the calculated minimum standard error of measuremeiit 
for 3ti ability estimate* This would seem to iTidicate that tlie 
Rasch model Is robust with regard to unequal discrimination* On 
the other hand , an AKOVA using absolute differences as a 

41 



29 



dependent variable produced unexpected results. For tests of 
appropriate difficulty for the sample ^ tlie mean absolute 
difference between estimated and true ability Increased as the 
pcrccntaf^c of fitting Items Increased Tbe authors could offer no 
explanation of tbese findings^ not did tbey attempt to assess tbe 
slf^nlf Icance of this trend* 

Several Issties tnnst be kept In mind when vlewlnf^ tbese 
results. First, the authors attempted to make their simulation 
as realistic as possible by baslnf^ their choice of levels for 
each Independent variable on that found In real test data (Rentz 
^ Rasbaw, H77). This was connendable. On tbe other band^ the 
data was (generated fron the two parametr** logistic model, thereby 
not Including any effect due to guessing* This was further 
Insured by a zero correlation between discrimination and 
difficulty parameters. In a recent study^ Yen (1981) has shown 
through simulation that tbe relationship between sets of ability 
estimates from two latent trait models depends largely on the 
generatlnf; model. This suggests that perhaps different results 
would have been obtained bad data been generated from a three 
parameter model In which lower asymptote values could be fixed. 

In the Curry study, the average discrimination for all tests 
was held constant. In terms of realistic test equating, subsets 
of Items rarely have equal average Item discriminations. Thus, 
even when difficulty parameters are on tbe same scale^ the 
ability estimates may not be. For example. If the test Is 
appropriate In difficulty for tbe sample, ability estimat*^s will 
Increase more rapidly In the middle range for a more highly 



ERIC 




30 



disc rim lfiatlfi£t test * Therefore » under such clrctimstances , 
unequal disc rlmlnnt Ion could affect Rasch ability estimates * 
Curry s^if^fxested this problem could he overcome hy rescallng 
ability with the average Item discrimination* Hovrever > Dlvgl 
(1981) provided evidence thnt a more highly discriminating^ test 
will show a hlf.her ability estimate at the upper end of the 
{llstrfhtitlon and n lotrer estimate at the lower end than will a 
test with n lower discrimination* Such n systematic error would 
not be renedled by a proportional constant across t'^e entire 
scale* In his sturly, Kolen (1981) lnvestlf*ated a modified Rasch 
model In which the two tests belnf^ equated were allowed to have 
different average discriminations* Tfowcver^ this procedure did 
not consistently Improve equatlnf^ results* 

Failure to account for guessing has been discussed as a 
reason for the Inadequacy of Rasch vertical equating* Evidence 
for this Is cited In n negative correlation between difficulty 
and discrimination parameter estimates* (To compute this 
correlation requires the estimation of discrimination 
parameters*) One sLirulatlon has been attempted by Gustaffson 
(1980) to assess the Impact of difficulty-discrimination 
correlations on ability estimation* Bias was measured In this 
study In terms of ability estimates within the same group versus 
estimates derived from another group and applied to the first 
group. This represented a replication of the methods used by 
Sllnde ft Linn (1?)78,1979) * The results showed that when 
difficulty and discrimination were uncorrected, mean ability 
estimates from high and low ability groups were very similar* 



43 



31 

However, for a positive or nef^ative correlation, a considerable 
bias resulted. For n negative correlation, higher ability 
estimates resulted from estimation within the group whose ability 
matched the test's difficulty level* That is, for example, for 
easy item^;, a hipher ability estimate would be obtained from 
parameters estimated by the low ability r,roup* Such a bias is in 
the Stime direction as that reported by Slinde Linn* These 
results suf.f.est that {guessing was a major factor in the poor 
Rasch vertical equating for those studies* 

Several methodological improvements could be made for future 
work* For IRT research, a data matrix is usually generated by 

comparing probabilities of success for each item (these are based 

If 

c:i the IRT model) with a uJ^formly distributed random number* It 
seems reasonable to use a set of parameters that reflect perfect 
fit to some model to obtain an estimate of the amount of random 
error in the simulation process* However, it might be possible 
that the expected distribution of errors for such a process could 
be derived. To date , the authors know of no one wtio has 
attempted to do this* 

These few simulations have barely scratched the surface of 
what needs to be known about test equating* Monte Carlo studies 
are typically costly and difficult to run, and so there has 
probably been a purely economic reluctance to conduct such 
research* On the other hand , the active manipulation of 
independent variables is sopething that can not be accomplished 
witti empirical studies, except on a post hoc basis* The scant 
research thus far has suf^gested that the Rasch model is sometimes 

44 



32 



robust to ability differences In parameter estimation and 
sonetlmes not* 

Some of tbe Issues tbat could be dealt wltb tbrouf^b 
simulations Inc lucle; 

1) How does unequal averaf^e discrimination affect equating error? 

2) Uov) do various types of content or multidimensional fit affect 
equating* error? 

3) Can tbe effects due to sblfts In population distributions of 
ability be separated from equatlnf^ error? 

4) wtiat Is tbe effect of differential reliability of the two 
tests on equating results? 

One nietliodologlcal consideration to be made by tbe 
researcber Is to decide on tbe model from wblcb tbe data will be 
{generated* For tbe Rascb models tbe two or three parameter 
logistic model could be used » but tbe researcber faces tbe Issue 
of whether or not these models adequately represent test data In 
real life* tben becones the focus of study Is tbe extent of 

agreement between two models » one of which contains fewer 
parameters than the other* Tbat can be useful^ depend Inf^ on bow 
realistic the fuller model Is* For researcb on the ttiree 
parameter models a major problem arises^ namely^ what should tlie 
generating model be? If a four parameter model Is used» what 
should be tbe additional parameter? Tbe same holds true for 
dimensionality studies » although perhaps some bootstrap 
approaches might be Improvised (e*g* using factor scores)* These 
methodological concerns have not yet been addressed* 

Despite these problems, >fonte Carlo methods bold a great 



ERLC 



45 



33 



deal of promise for assessing; some of tho test cquatlnf; problems 
tbzit cannot be addressed tbrou;^b eniplrlcal studies* Moreover, 
tbc aiitbors bellove tbat until sucb work Is completed, furtbor 
work wltb exlstlnp data sets will not be very useful* 



In revlcwln;^ test equating; researcb, many moro questions 
bave been raised tban bave been answered* Tbe results clearly 
indicate tbat no slnf.le metbod if; superior to tl^e otbers In all 
contexts* Because of tbls researcb needs to be broadened to 
Include specific aspects of tbe equating problem* One tblnf^ tbat 
bas become clear Is tbat researcb wblcb at tbe time seemed to 
support one model or anotber (In tbls case usually Rascb) can be 
cballenf>ed In llf^bt of wbat bas been done since tben* 

In tblf; final section, we sliall discuss several Issues tbat 
are pertinent to test equating, but wblcb bave received little 
attention ttuts far* V^e sball also try to provide directions for 
future researcb and summrl^e the coiicluslons reacbed In tblf; 
paper* 

Afisesslnf, adequacy of equatlnf> 

In tbc f;tudles tbat liave been discussed, a number of mctbods 
bave been Introduced for evaluating bow well cquatliig procedures 
performed* In many canes^ tbe cbolce of metbod was guided by 



l^^VV.S RRLF.VAMT TO T?ST EQUATING 



ERIC 




3A 



limitations In the deslf^n of the study. Stilly some comments are 
In order hccntiHe the conclusions reached In a study are 
Influenced by the manner In which tbc results are evaluated* 

For example, Urlf^ht's standardl;!ed difference Index has been 
used frequently. A nean of ^ero and standard deviation of one Is 
a necessary hut not sufficient condition for Iterj-frec person 
measurement. As a sunrrary statistic, systerjatlc differences In 
ahlllty estimates can average out to a zero ahsolute mean 
difference. Dlvf^l (1981) dononstrates bow this can occur hy 
uslnf a residual plot across tbe raw score scale. In bis 
cxanple, uslnf, data froD the Metropolitan Reading Test, he showed 
that a difficult subtest yielded hlf^her ahlllty estimates at the 
extrenes and an easy subtest yielded hlj^her ahlllty estimates In 
the Tnl<1dle of the raw score dlf^trlhutlon. Yet, the mean and 
standard deviation of standardized differences were *.024 and 
1.21, respectively. How well this example ^/moralizes remains to 
he seen. On the other band, Whltely £ Dawls (1Q74) reported 
values similar to those of Dlvgl'5 example. It would have been 
helpful In this and other studies to have seen standardized 
differences plotted as a function of the raw score or ahlllty 
scale . 

^Hlcb the same could he said of correlating, sets of ability 
or difficulty estimates. As Tlnsley ^ Dawls (1975) have shown, 
ability estimates can correlate almost perfectly even though 
difficulty estimates correlate nef,at Ively or not at all . 
Correlations can also he Influenced by extreme cases, and they 
are fairly l.ifnune to differences between variances of tw{. sets of 



ERLC 



47 



estimates* Because of the these prohlems^ correlations are at 
host only a crude estimate of Invarlance. One does not know )iow 
lilp.h correlations should be. Most studies have suhjectlvely 
appraised their values^ hut It could turn out that *95 is 
suhstantlally lowi^r than one wotttd predict from a particular 
model* 

^f^lny studies, particularly those comparing several IRT 
models » have used a mean square error concept as a summary 
statistic* This Is a traditional concept In assessing 
measurement error* With respect to test equating, comparisons 
between two sets of ability estimates or estimated true or 
observed scores Imply that one of the sets Is the criterion, or 
true, distribution. The total mean square can be broken down 
Into bias and Imprecision Indices* The major problem with this 
approach Is that both sets of estimates are based on fallible 
scores and neither one Is truly a criterion measure* The problem 
of the criterion has never been solved. This Is larj^ely a 
theoretical Issue. According to Lord (1980, Theorem 13*3*1), 
tests cannot be strictly equated unless titey are perfectly 
reliable or strictly parallel* Because this Is almost never 
true, tests can only be equated In a tau-equlvalency sense (see 
Whltely *f Dawls, 197A, p* 170) , that Is, In tetms of expected 
scores on two tests* In the empirical studies reviewed here, 
error of equatlnj* Is confounded with person measurement error* 
To the authors' knowledge, no study has yet been done to try to 
separate test unreliability from equating error* 

A variance approach may also contain biases that c<jn affect 



ERIC 




t 



36 

a measure of how well certain models perforn, Kolen fx Uliltney 
(1981) J for exanple^ noted that a relatively larf^e variance of 
converted eqnatlnf^ scores will result In a relatively large value 
of Imprecision and hence of total error. The authors believe 
that this was a reason for a hlf^her value for the three parameter 
model than for others. Conversely^ the possibility exists for a 
model to look t^etter than It really Is, simply because of a small 
variance of equated scores* 

Another aspect of this problem Is bias Iti the criterion 
score Itself, This score could he an estimated true score, 
ability score, scaled score, or ^ raw score* Marco, Petersen, & 
5;tewart (1979) addressed criterion bias In their study* For 
horizontal equatlnf^, a tost was equated to Itself, with the test 
score servlnft as the criterion* The authors felt that this 
procedure may liave favored tho Rasch model because more ICC 
parameters were fixed at constant values than for other IRT 
nodels. In the vertical equatlnf^ portion of their study, the 
criterion scores had to be calculated* It turned out that the 
mean square errors for the various models were considerably 
dependent on the method used to calculate the criterion scores* 
In other studios (Cot^ell, 1981; Cook, Dunbar, & Elgnor, 1981), 
equated scores from one of the models was used as the criterion. 
In these cases, mean square error became a measure of agreement, 
and so It was Impossible to tell If any method worked well at 
all. 

l^hat Is to be done In llfr^it of such conflicting Information? 
Obviously, some research needs to be done on the, matter of 

49 



37 



criterion hlas* Also » we ncGd to know more about Gquatlnjx froni ^ 
distribution theory point of v1gw» That Is, what Is the 
dlstrlhiit Ion of error from IJ?T equating, ncthorfs? For the 
present) tl, ; authors recommcml that conclusions based on a sln;;Ie 
sunmary statistic bo conslrfGrcd very questionable* Multiple 
assessment procedures should be utilised * An especially 
Important proce<IurG Is to examine differences In ability 
estimates at points across the entire ability scale so that any 
systematic errors may be spotted* Studies wtilch have provided 
stich p.raphlcal or Kcatterplot techniques bave been Inherently 
more useful* 

Sources of equatlnf^ error 

The previous discussion has pointed out the need for more 

■ 

Investlf.atlon Into sotirces of eqnattn;^ error* The purpose of 
this Is to see what systematic errors may result from the effects 
of <M f ferent linking: procedures, differential reliability, 
parameter estlnntlon, anrf shrinkage* 

In the Petersen^ Cook» & Stocking (I98I) study, vastly 
different results i^re obtained from the the three linking 
procedures even thouf^h all were based on the three parameter 
model* According to Lord (personal communication) , perfect 
correlations between difficulty estimates are not found for two 
major reasons lack of model fit and sampling fluctuations* 

Lord points out that the latter probably predominates In real 
data sets * Most stu<fles In the literature used a llnea 
transformation based on the means and standard deviations of Item 



ERIC 




3fi 

difficulty GStltnatGS* For the three parameter model, such a 
transform/itlon i;>norcs information from the discrimination and 
lower iisymptote parameters* Recently, methods have heen proposed 
which are based on nlnlmlzinf; tlie differences between Item 
characteristic curves (Pivgl, 1980; Haehara, 1980; Stocklnf?: it 
Lord, 1982)* These methods are more complex, but Initial results 
seem cncoiira;*lnf^* 

Another source of error In equating lies In the estimation 
of p/iraneters* Difficulties In estimation were mentioned by 
JColen (1981) and Patience (1981) as a problem with their results* 
labile a detailed examination of procedures is beyond the scope of 
this paper, several comnents can be made* One of the major 
criticisms ipade by Uripht (1977) of the three parameter model was 
that discrimination and lower asymptote parameters could not be 
est Imated without severe restrictions ' helnf^ placed on the 
estimates. In n simulation stndy Ree (1979) noted that the 
quality of parameter estimation with LOOIST and other programs 
depended partly on the characterist Ics of the ability 
distribution of the sample on which estimation Vas based* Work 
by Reckase (1979) Invest lf;ated parameter estimation in 
conjunction i/ith sample size, length of the anchor test, and the 
type of linklnp. procedure* His resulls suggested that for larger 
sample sizes the x^nf^th of the anchor test was not critical* 
Larger samples In this case meaiit 300 or more for the Rasch model 
and 1 »000 or more for the thre^i parameter model* The Issue here 
Is really a question of which is more costly 1ti a practical sense 
— poor parameter estimation or failure to Include souces of 



vartiitlon In the model* It's an Interesting rese<irch topic an^. 
one that h<js rarely lieen explored directly. 

Another source of equating error that has not to the 
authors' knowledge been Investlf^ated Is the effect of unequal 
test rellablllttes* As has been discussed |)revlously» unequally 
reliable tests cannot meet Lord's equity requirement* However, 
some as.sessment needs to be made of this problem In order to 
estimate any systematic effects that may occur In the equating 
process. This seems particularly Important for vertically 
equated tests, where examinees' scores on ln~level and 
oiit*of-level tests are compared* 

Finally, there Is some concern over the generallzcablllty of 
eqiiatlni; results* In many of the early equating studies, 
parameter estimates used for equating were derived fron the same 
group for whoo the equating results were applied. However, In 
lator studies (Kolen, 1981; Loyd Ifoover, 1980; Sllnde Linn, 
1978, 1 <)79) , calibration based on one group was applied to 
another* For the Hasch nodel, the results were not quite as 
encouraging. The situation Is similar to that of shrinkage for 
multiple regression* The best recommendation to account for 
this, made by Kolen (1981) and Kolen & trtiltney (1981), Is to use 
a cross-validation sample whenever possible. 

^fultldlmcnslonaltty 

Violation of the unldlmenslonallty assumption has been the 
least studied of several possible deviations from IRT models* 
And yet, this nay ultimately he tbe most Important source of 




ERIC 



AO 

misfit. As has been previously pointed out, miiitldlmGnslonnllty 
precludes the nse of these I!^T nGthods, but It also In a more 
ftGneral sense , prGcludes Gquatlnc altogGthGr. Consider, for 
XnstancG , specifically chan^tes In curriculum content across grade 
levels. Does It mnke sense to equate two test levels when their 
content differs? In some cases. It may not be meaningful at all 
to equate vertically. 

Factor analysis is the most frequently used' method of 
nssessfn;^ unld imenslonallty . This Is usually obtained with n 
non-'rotnted principal factors solution with estimated 
connuna Titles In the dlaf^onal of the Inter Item correlation 
natrlx. Re;*ard less of probleps with the procedure , 
Interpretation Is problematic* Typically, one wishes to account 
for as niich variance as possible with the first factor* 
Unfortunately, this first factor, frequently labeled a general 
ability factor. Is not necessarily the trait that was supposed to 
be measured by tbe test , that Is , reading comprehension , 
vocabulary, etc. If the first factor accounts for say 75% of the 
variance on a nath test, this does not In any way Indicate that 
the test Is unldlmenslonal and that dimension Is math* The trait 
called math could only be extracted through rotating the 
solution, a procedure which tends to spread out variances across 
factors . 

A major re^^son why mn It Id Imenslonallty has net been 
Investigated Is that It Is an Immensely more difficult Issue to 
study, h^ultldlmcnslonal latent trait models have not yel been 
consistently theorized nor have parp"eter estimation procedure;* 

53 



41 

been developed, altliougb exanples of two dimensional models can 
be found in Unnstlen (1978) and Goldstein (1980)» 

An examination of empirical equating studies shows why this 
issue deserves more attention* tfany of ttie studies (e*g* Kolen^ 
1981; Petersen, Cook, £ Stockinr,, 1981 ) found considerable 
difference in the adequacy of various equating methods for 
d iCferent types of tests * As yet , we do not know what 
characteristics of these ::ests cause the differences* It 
certainly makes conparisons across different studies difficult If 
not impossible* The best that can be recommended to the 
practioner is to select the method that works best for their 
particular test* That Is a weak recommendation because one does 
not know how consistent the results will be on cross-validation 
and because of the problems of defining the criterion score* 

The lack of knowledge about multidlmenslonallty Is a major 
obstacle to interpreting equating results* This Is not meant to 
be a criticism of previous research but an Illustration of how 
new the field of IRT equating Is and how far It needs to go 
before definitive answers can be attained* 

Out-of-level Testing 

A growing literature that has been rela . /ely independent of 
the test equating studies Is the research on oufof "level 
testing* Out"of-3evel testing concerns the testing of examinees 
at a level of a test battery other than the one that would 
usually be assigned to them according to grade cr age* Several 
studies in recent years (Ayrer £ ^fcMamara, 1973; Long, Schaffran, 



42 



Kjllofif*> 1077; Ozenne^ 1979) have convincingly demonstrated 
that substantially different grade-^equivalent scores result from 
in-level versus out-of-level testin^^; of examinees. Most of this 
research has dealt with students functioning^ nt a level below 
their p^ride level. Testinf> such examinees at a level appropriate 
for their skills allows for a much better assessment from a 
diaf\nostic and instructional point of view* However^ no one is 
sure how to place scores on the scale of the in-level test* 

Vertical equatinf^ offers a possible solution* The work of 
Slinde Linn (1977) suf.fxested that IRT methodology nip^ht provide 
a better solution than tr^iditiona! approaches* 1!owf*ver, test 
equating research has paid little attention so far to the 
literal ture on out-of*level testinf,* There is virtually no 
cross-referencinf> between the two literatures* A result of this 
if; that test equ£]tinf> research has not addressed the problem of 
interpretinf; out-of-level test scores* Of the studies reviewed 
here, only one (Guskey, 1981) has compared the latent trait 
ability scale to the j^rade-equivalent scale* Another possible 
solution is out-of^level norms* To the authors' knowledf.e, no 
Olio has compared such norms to the latent ability scale* 

School syi ^ ^ms are facing- today the problem of appropriate 
test levels* To date, the IKT literature has not focused on the 
problem directly, and - real needs exists for furtlier study* 



Practical implications of results 

For lest iiser*i* the results of IRT equating researcli thus 
far must seem quite confusing,* The field has simply not 



ERIC 



55 



43 



cJeveloped to the point tliat very many conclufiions can be reached. 
In other wonls^ mfiny cpiestLon;; of crltlcfil concern have not been 
fully finsware<t by tlie research thus far. These include: 

1) Whether IRT methods provide better equating results that 
traditional methods. 

2) Whether it ts better to develop oiit-of-level testinf» scores 
throufih vertical equatinf^ or f^eparate norm;;* 

3) Whether latent trait ability estimates provide a nore valid 
measurement of nihility than f^rade-equivalent or other standard 
scores . 

In terms of future research^ further empirical studies with 
one test or another are not likely to be useful. Some work with 
Monte Carlo procedures would be very helpful in terms of 
exaF>ininf» potential influences on equating results* In addition^ 
there will probably be a trend towar<l assessing specific sources 
of equating error such as parameter estimation » linkin^i 
procedures, and test reliability. Finally, some theoretical work 
is neede<1 with the distribution theory of equated scores. 

Recommendations for practical applications of test equatinf^ 
based on our review of the literature can be sunnarized as 
folloi/s : 

Fcr horizontal equatJiig, 

1) Ko ftlnftle method is consistently superior to the others. 

2) If the data are reasonably reliable, tests are nearly equal in 
difficulty^ and samples are nearly equal in abtlityj probably any 
method will achieve satisfactory results. 

3) The Rasch noJel should not be used where a substantial amount 



ERLC 



56 



of gucssifif* liQS occurred. 

Ttic three [mrsnictcr model fshould not he used with small sample 
sizes (loss tlmn 1,000). 

5) Po;>ulQtioii clidfifies should be tnvestifrated if the equ^ttngs 
tfike place over a loiif* period of ttine* 
For vertlcfil oqtifitlngj 

1) The Rasch model ^should not be used at nW unless test 
difficulties differences are small and f^nessinf* ts mtnimtzed. 

2) The ttii'oe pfiraneter model hfis not been proven to he superior 
to tbe Rfisch nodcl: very little work lias been done with it tn 
vert tcfil equattnjT. 

3) Tn terms of content differences^ it mfiy not be meaningful fit 
qH to equfite vertically* 

V'hen Q vertical eqtiatlnp must be done, the Sfifest procedure at 
this time wotild probably be to use on ef^iipercentile approach* 
In fietieral , 

1) Re!!ults should be cross-validated whenever possible* 

2) ^^llttple procedures should be used to evaluate equating 
results* These include hot?, summary statistic** and r^rapbical 
pref;entfit Ions * 



Er|c 57 



A5 



Anderson, J,, Kenrney,G,E* » ^ Rverett,A*V* An evaluation of 
l^isc^'s structurnl mode) for test Items* Prltlsh Journal 
of Mathematical and ?tnt 1st leal Psycholof*y, 



Anf*off ,UM1, Scales, noms, nnd equivalent scores, Tn R J,* 
Thorndlke (ed*)* Educational Measurement (2nd ed,)* 
Pnshlnfjton, D,C*: AFierlcan Council on Education, 1^71* 



Ayrer, J*P* t'cN'amara, T*C, Survey testlnf: on an out-of-Jevel 
ha^Is, Journal o*" Fducationnl MeasiT" *nient, 1973,10,79-8A* 



Blrubaum^A* 5>onr latent trait models and their use In inferring 
an examinee's ahll i ty * In VA\ Lord ^ lUR, Rovlck* 
Statistical theories of mental test scores* Readinf;> 
^*as^; * ; Addison-Uesley , 1968* 



Cook ,ltl, > nunhar ,S *B* , & Elgnor ^ D*R* IRT equat^.ng: A 
flexlhle alternative to conventional methods for solving 
practlcil testlnf^ profclenis* Paper presented at annual 
meeting of American Educational Research A5;soclation^ Los 
Ancelos > l^fll , 



Cdok,L,L* ^ Elf^nor,D*R* Score equating and Item response 
theory: Sore practical consideration??* Pape^ presented at 
annua I meeting of National Coiincl 1 on Measurement In 
Education, Los Anf»e1es> 1981* 



Cou£»l 1 ,^'* Appi Icahl llty of a simplified three paraneter 
lojdstlr riodel for equating tests* Paper presented at 
annual nectlnf^ of American Educational T?e,scarch 
Association^ Los Angeles, 19*^1* 



Curry, A*i>*, Pashaw,W*L*» Rentz, R*R* Invarlance of Rasch 
model ability parameter estimates over different 
collections of Itons* Paper presented at annual meetlnft 
of American Educational Research Association, Toronto, 
1978* 



nivgl,b*R, Evaluritlor of scales for multi-level tost batteries* 
Paper presented at annual meetinf> of Anerlcan Educational 
"^rccarch Afysociatlon, Boston, 1980* 



58 



46 



I>ivf^i,D,R, Poes the Rnsch model really work? Mot if you look 
closely, Pnper presented at annual meetinj^ of American 
Fdttcationni Research Association^ I^s Anf^eles^ 19B1* 



Goldstein ,11* Dimensionality , bins, independence » and 
measurement scale problems in Intent trait score models* 
British Journal of Mathenatlcal and Statistical 
Psycho}ojiy, 1*^80, 33, 234-246, 



Cuskey,T,R, flompari.son of a l^asch nodcl scale and the 

prnde-equtvalent scale for vertical equntinp of test 

scores, Appl ied Psycho loficnl ^*e^lsure^ent > 
1^81,S,1P7-20K 



^ustnf fson,J,E, The Rnsch oodel in vertical equatinp, of tests; 
A critique of SUnde and Linn, Journal of Educational 
^'ca^?urement, 1970,16,153-158, 



Custnf fson>J,E, Testing; and obtalninj^ flL of (Jnta to the Rnech 
nodel, l^ritish Journnl of >*nthematlcal and Statistical 
Psycholor^y, 198O,33,205-233» 



1?aehara,T, Equntin/^ loj^istic ability scales by a weighted lenst 
squares * Japanese Psychological Research » 

1980,22,l/*/*-U*), 



imhleton^R^K, Cook,L,L, [,atent trait models and their use in 
tho analysis of educational test data. Journal of 
Pdticatlonal Measurement, 1077,lA,75-9ri, 



IlamMeton,R,K, , Swaminatban,!!, , noo^ ,t,L, , EifTnor,n,R, , 5 
Gif ror<I,J,A, Developments In latent trait theory; Models, 
technical Issues, and applications, Ileview of liducatlonal 
Research. '978,/t8,Afi7-';iO, 



Kolon,M,J, Comparison of traditional and item response theory 
methods for equating; tests , Journal of t;ducatlonal 
Measurement, 1981,18,1-11, 



Kolen,MtJ, & l/hitney,I>tR, Comparison of four procedures for 
equatinn the tests of General Educational Development, 
Paper presented at annual meetinjj of American Educational 
Research Association, Los An^^eles, 1981, 



59 



hi 



Lonj^jJ- V - , Schaf friin, J,A, , b Kellof»f,,T-M, Effects of 
oiit-*of-level survey testinf, on rerictinf^ achievement scores 
of Title I , ESEA students- Journal of Educational 
re/isurement, 1977, lA, 203-213, 



Lord,F-^<, Practical applications of item characteristic curve 
theory* Journal of Hdiicaticnal ^feasurement , 

1977, U, 117-138, 



Lord>F-M, Practical applications of item response theory, 
HiJlSil/ile, K,J,: Uwrence Erlhaum, 1980, 



Loret ,P,C- , Seder > A, , Pianchini , J,C, > £f Vale^C- Anchor test 
study fin/il report: Project report and volumes 1 throuf*h 
30, nerVeley, Calif,: Educational Testing: Service, 197^, 



Loyd,R-l!, f'oover,Tl,n, Vertical equatinf* usinf> the R^isch 
model , Journal of Hdtication/il ^ferlsurement> 

1980,17,179^93, 



l-itmsden,J, Tests rire perfectly rellahle, British Journal of 
^'rlthe^atica^ and Statistical Psychology, 1978,31,19-26, 



^^lrco,C-^, Iten characteristic curve solutions to three 
intract^ihle testinf^ problems, Journril of Educational 
re/isurenent, 1977, lA, 139-160, 



f**irco,C.L, , Petersen, W-S, , ff Cook,L-L, A test of the /idequacy 
of curvilinear score equating: models. Paper presented at 
the 1979 Computerized adaptive testing conference, 
t'inne/ipolis, 1^79. 



Ozenne,!), You may not f»et what you think you are f^ettin^i; lv*hat 
test scores mean in out-of-2evel testinf^. Paper presented 
at annual neetinf; of American Educational Research 
Association, San Francisco, 1979, 

Pationce,W- A comp/irison of latent tr/iit ;ind equlpercentile 
methods of vertically equating: tests. Paper presented at 
annual meetinj^ of \'ation^l Council on Measurement In 
Educ/ition, Los Anj^eles, 19S1, 



ERIC 



Petersen ,N, S, , Cook,L.L, , Stock inj^,M,L, IRT versus 

conventional equating methods : A comparative 5tudy of 



60 



scale drift * Paper presented at anniial meetifif, of 
Anerican Educational Research Association, Los Angeles, 
1<)81, 



Rascti ,G* Probablistic models for sone Intel lif^ence and 
attainnent tests* Copenbaf.en: Danish Institute for 
Educational R-»search, 1960* 



Recknse,M*n* Item pool construction for use with latent trait 
nodels* Paper presented at annual meeting of American 
Educational Research Association, San Francisco, 197<)* 



nGe,M*T* fstimatinf! item characteristic curves* Applied 
Psycholof.ical Measurement, 1979,3,371*385* 



nent^>R*P* ^*onitorlng the qunlity of an item pool calibrated by 
the ttascti nodel* Pnper presented at the> annual neetinf, of 
National Council on I'easurenent In Education, Toronto, 
1978* 



Rentz,R*R* I Bashaw, 1^*L* The ^'ational Reference Scale for 
reading; An application of the llasch model* Journal of 
Educational ^'easure^ent , 1977, lA, 161*179* 



SiInde,J*A* & Linn,l?*L* Verttically equated tests; Fact or 
phanton? Journal of Educational Heasurencnt , 

H)77aA, 23-^32* 



$linde,J.A* £ Linn,R*L* An exploration of the adequacy of the 
Rasch nodel for the problem of vertical equatinj** Journal 
of Educational >'easurement , 1978,15,23-35* 



Slinde,J*A* A Linn,R*L* A note on vertical equatin^j via the 
Rasch model for j^roups of quite different ability and 
tests of quite different difficulty. Journal of 
Educational ^reasurelnent , 1079 , 16, 159-165* 



5ltocl:inf,,M*L* l*ord,F*t!* Pevelopinf, <i contnon metric in cen 
response theory* Unpublished manuscript, 1982* 



TinsleyJKE* & Dawis,R,V* An investif,a*:ion of the Rasch simple 
logistic model: Sample*free item and test calibration* 
Educ<itional and Psycbolojtical ^fea5urelaent , 



1975,35,325-339* 



ERIC 




V/hltGly ,S.K* Models, meanlnf^s, and nilsufi<!crstafidlfif^s: Sohg 
Issues in npplylnf^ Rascli's theory* Journal Educational 
FGnsurcniGnt , 1977, U , 227-2 36* 



WliltGly & f)awls,R*V* The nattirc of objectivity wltli tliG 
Rascli model * Journal of Educat lonnl ^*€a^^urGl^Gnt , 
l«74aM63-17S* 



IJood , IJln^Grsky ,M*S* , & T*ord,F*M* LOCTST: A conpiiter 

prof.rain for estltnnt Inf^ GxamlnGG ability and ItGm 
cbarnctGrlstlc curvG parnmGtGrs * W 76^6* Princeton, 
K*J,: Educational TGStlng SGrvlcG, 1976* 



U'rlf^ht ,U*n* SamplG-f ree test calibration and person 
HGasurGmGnt * In proceGdlnf^s of tbG 1967 Computerized 
adapt IvG testlnj^ conferGncc on tGStlnp problems* 
Princeton, ll*J*t Educational Testlnj* Service, 1968* 



Wrlnbt,B*D* Solving mGasiirGiaent problGias with thG Rasch ntodGl* 
Journal of Educational NGasurenent, 1977,14,97-116* 



Wright, U*D* & Mead,R*J* BICAL: Calibrating rating scales with 
tliG Rasch modGl* Research Memorandtim f^o* 23* Chicago: 
Statl;5tlcal Laboratory, DGpartmGnt of Education, 
I'nivGrslty of Chicago, 1^76* 



Wrlpht,B*D* Panchapakesan,^* A procedure for sample-free Item 
analysis* Educational and Psyctiolof>lcal MeasurGmGut , 
1969,29,23-AS* 

YGn,V'*>** Pslng simulation results to choosG a latent trait 
nodGl * AppllGd Psycho loft leal ^*GasurGmGnt , 19fll , 5 ,245-262 * 



B2 



